NSF PAR Search | NSF Public Access Repository

Combining Multiple Cues for Visual Madlibs Question Answering

https://doi.org/10.1007/s11263-018-1096-0

Tommasi, Tatiana; Mallya, Arun; Plummer, Bryan; Lazebnik, Svetlana; Berg, Alexander C.; Berg, Tamara L. (April 2018, International Journal of Computer Vision)

This paper presents an approach for answering ﬁll-in-the-blank multiple choice questions from the Visual Madlibs dataset.Instead of generic and commonly used representations trained on the ImageNet classiﬁcation task, our approach employs acombination of networks trained for specialized tasks such as scene recognition, person activity classiﬁcation, and attributeprediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support forfeature extraction. We map each of these features, together with candidate answers, to a joint embedding space throughnormalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scoresfrom nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a signiﬁcantimprovement over the previous state of the art and conﬁrm that answering questions from a wide range of types beneﬁts fromexamining a variety of image cues and carefully choosing the spatial support for feature extraction.

Full Text Available

Search for: All records